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Abstract — Prior  research  has  focused  on  intra-domain  fault 
localization  leaving  the  cross-domain  problem  largely  unad¬ 
dressed.  Faults  often  have  widespread  effects,  which  if  correlated, 
could  significantly  improve  fault  localization.  Past  efforts  rely  on 
probing  techniques  or  assume  hierarchical  domain  structures; 
however,  administrators  are  often  unwilling  to  share  network 
structure  and  state  and  domains  are  organized  and  connected 
in  complex  ways.  We  present  an  inference-graph-digest  based 
formulation  of  the  problem.  The  formulation  not  only  explicitly 
models  the  inference  accuracy  and  privacy  requirements  for 
discussing  and  reasoning  over  cross-domain  problems,  but  also 
facilitates  the  re-use  of  existing  fault  localization  algorithms 
while  enforcing  domain  privacy  policies.  We  demonstrate  our 
formulation  by  deriving  a  cross-domain  version  of  SHRINK,  a 
recent  probabilistic  fault  localization  strategy. 

I.  INTRODUCTION 

Cross-domain,  multi-domain,  and  inter-domain  fault  local¬ 
ization  are  synonymous  terms  that  describe  determining  the 
root  cause  of  a  network  failure  whose  effects  propagate  across 
administrative  domains.  When  data  is  required  from  more 
than  one  domain  to  isolate  a  fault,  a  cross-domain  solution 
is  needed.  A  study  of  routing  instability  found  that  all  parties 
pointed  to  another  party  as  the  cause  in  about  10%  of  the 
problems  [1]. 

Eaults  often  have  widespread  effects,  which  if  correlated, 
can  significantly  increase  fault  localization  accuracy.  We  de¬ 
fine  inference  gain  to  be  the  increase  in  inference  accuracy 
achieved  by  correlating  additional  evidence.  Cross-domain 
network  failures  can  not  always  be  localized  without  a  co¬ 
ordinated  effort  between  domains.  As  an  example  consider  a 
scenario  in  which  an  operator  makes  a  typo  in  the  A  record 
for  a  web  service  in  an  authoritative  DNS  server.  The  domain 
administrator  may  not  be  able  to  isolate  the  fault  quickly,  and 
may  not  even  be  aware  that  a  problem  exists  for  a  period 
of  time.  While  the  fault  remains  unabated  and  potentially 
unnoticed,  there  may  be  observations  external  to  the  domain 
that  could  help  detect  and  localize  the  fault. 

Privacy,  scalability,  and  interoperability  issues  hinder  ef¬ 
forts  to  achieve  accurate  cross-domain  fault  localization.  While 
prior  work  has  stated  the  importance  of  these  issues  [1]- 
[4],  to  our  knowledge  there  has  been  no  formal  definition 
of  requirements  addressing  them.  Network  domain  managers 
are  often  unwilling  or  not  permitted  to  share  detailed  internal 
network  architectures  and  quality-of-service  issues  with  out¬ 
side  agencies,  running  face-first  into  the  need  to  share  data 


to  successfully  troubleshoot  networking  issues.  Automated 
techniques  for  finding  faults  across  a  large  number  of  domains 
face  serious  computational  issues  and  exact  computation  using 
belief  networks  is  NP-hard  [5].  Interoperability  in  network 
management  and  fault  isolation  techniques  is  a  perennial  prob¬ 
lem:  Different  modeling  techniques  and  tools  using  different 
algorithms  will  be  employed  in  various  domains.  Conflict 
of  information  formats  and  semantics  may  arise  between 
domains,  with  each  domain’s  model  assigning  a  different  value 
to  the  same  parameter. 

In  this  paper  we  characterize  the  problem  space  for  cross¬ 
domain  fault  localization  and  propose  an  inference  graph  di¬ 
gest  approach.  A  cross-domain  approach  must  achieve  accept¬ 
able  accuracy  while  satisfying  privacy  concerns.  We  illustrate 
an  approach  whereby  domain  managers  using  causal  graphs 
to  model  fault  propagation  can  use  a  function  to  create  a 
digest  representation  of  their  network  state  and  dependencies 
to  participate  in  a  collaborative  effort  to  localize  a  cross¬ 
domain  fault  by  sharing  observations.  By  addressing  privacy, 
scalability,  and  interoperability  issues  with  our  graph  digest 
approach,  we  attack  the  obstacles  that  prevent  collaborative 
cross-domain  fault  localization.  Although  the  discussion  in 
this  paper  focuses  on  the  formulation,  and  demonstrates  the 
utility  of  the  formulation  by  creating  a  causal  graph  digest 
for  a  probabilistic  inference  method,  the  concepts  discussed 
in  creating  a  graph  digest  could  be  extended  to  include  other 
domain  representations  and  inference  methods. 

In  Section  II  we  describe  related  work  including  Bayesian 
approaches  to  localize  intra-domain  faults  and  approaches 
addressing  the  cross-layer  fault  localization  problem.  In  Sec¬ 
tion  III  we  discuss  the  challenges  and  tradeoffs  associated 
with  using  a  graph  digest  and  formulate  an  approach  to 
perform  cross-layer  fault  localization.  Section  IV  illustrates 
the  graph  digest  approach  for  a  cross  domain  implementation 
of  SHRINK.  We  describe  our  assumptions  and  initial  efforts 
to  define  a  graph  digest. 

II.  Network  Eault  Localization 

As  characterized  by  Steinder  and  Sethi,  fault  localization 
is  the  second  step  in  fault  diagnosis  following  fault  detection 
and  preceding  testing  [6]-[8].  Network  administrators  use  fault 
localization  techniques  to  discover  best  hypotheses  explaining 
the  observations  detected  in  the  fault  detection  step.  Myriad 
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techniques  have  been  developed  for  fault  localization,  in¬ 
cluding:  rule-based  systems,  model-based  systems,  case-based 
systems,  neural  networks,  decision  trees,  model  traversing 
techniques,  code-based  techniques,  Bayesian  networks,  de¬ 
pendency  graphs,  causal  graphs,  and  phrase  structured  gram¬ 
mars  [6]. 

A  current  trend  attacks  the  problem  by  modeling  network 
dependencies  in  a  directed  acyclic  graph  having  root  causes 
as  parentless  nodes,  observations  as  childless  nodes,  and 
dependencies  represented  as  directed  edges  in  the  graphs  with 
uncertainties  captured  in  conditional  probability  distributions 
associated  with  each  node  [5],  [9],  [10].  This  graph  structure 
is  also  known  as  a  causal  (or  causality)  graph  [6] .  Approaches 
typically  perform  probabilistic  (Bayesian)  inference  on  bipar¬ 
tite  causal  graphs  [5],  [10]. 

Root  causes  to  network  failure  are  also  known  as  shared  risk 
groups  (SRGs)  [3].  SRGs  typically  represent  hardware  com¬ 
ponents  that  can  fail,  impacting  service  for  a  set  of  dependent 
services  or  communication  channels.  In  the  graphical  model, 
edges  depicted  from  an  SRG  to  each  node  directly  influenced 
by  the  state  of  the  SRG  represent  conditional  dependencies.  In 
bipartite  causal  graphs  the  edges  only  connect  from  SRG  nodes 
to  observation  nodes  allowing  faster  probabilistic  inference  as 
compared  with  general  Bayesian  networks. 

The  SCORE  [9],  SHRINK  [5],  and  Sherlock  [11]  ap¬ 
proaches  form  the  state  of  the  art  for  leveraging  causal  graphs 
for  fault  localization.  SCORE  uses  a  set  covering  approach  for 
flnding  the  best  explanation  (set  of  failed  SRGs)  for  observed 
outages  based  on  a  bipartite  graph.  SHRINK  enhances  the 
model  to  allow  probabilistic  inference  by  attaching  edge 
weights  that  are  combined  using  the  noisy-OR  [12]  model  to 
form  conditional  probability  tables  for  each  observation  node. 
Sherlock  further  extends  these  approaches  with  a  multilevel 
causal  graph. 

Very  little  research  is  published  on  cross-domain  fault  local¬ 
ization.  Probing  and  monitoring  techniques  can  be  leveraged 
to  assist  with  collection  of  information  about  network  state  and 
structure.  A  cross-domain  fault  localization  approach  is  pre¬ 
sented  by  Steinder  and  Sethi  [13]  for  hierarchically  organized 
networks.  This  approach  locates  the  source  of  an  end-to-end 
service  failure  through  distributed  coordination  between  the 
domains  along  the  path  of  the  failure.  In  addition  to  an  existing 
domain  hierarchy,  the  approach  relies  on  full  knowledge  of 
each  end-to-end  data  path  at  the  domain  level. 


(a)  Physical  topology  (b)  IP  View 

Fig.  1.  Example  network. 

SHRINK  [5]  performs  Bayesian  inference  on  a  bipartite 
causal  graph.  The  SHRINK  model  assumes  independent  fail¬ 
ures  of  root  cause  nodes  and  that  that  no  more  than  three  SRGs 


Fi  X,  F,  X,  F3  F4 


will  fail  simultaneously  in  a  large  network  based  on  the  ex¬ 
tremely  low  likelihood  of  four  or  more  simultaneous  failures. 
Noisy-OR  is  used  to  calculate  the  conditional  probability  table 
for  a  node  with  multiple  parents.  The  SHRINK  algorithm  is 
deflned  as  follows.  Let  <  Si, ...  ,Sn  >  denote  a  hypothesis 
vector,  where  5'^  =  1  if  a  failure  of  SRG  Si  is  assumed,  and 
Si  =  0  otherwise.  Let  <  Li, ... ,  >  denote  an  observation 

vector,  where  Lj  =  1  if  a  failure  of  Lj  is  observed,  and 
Lj  =  0  otherwise.  Given  a  particular  observation  vector,  the 
SHRINK  algorithm  searches  through  all  hypothesis  vectors 
with  no  more  than  three  assumed  failures,  and  returns  those 
maximizing  the  posterior  probability: 

argmax  Pr{<  Si, . . . ,  Sn  >  \  <  Li, . . . ,  Lm  >)  (1) 

Consider  the  simple  scenario  depicted  in  Eigs.  1,  and  2. 
Eig.  1(a)  depicts  the  network  physical  topology,  in  which  IP 
routers  A,  B,  and  C  are  connected  across  flbers  Fi  -  F4 
and  optical  cross-connects  Xi  and  X2.  Each  IP  router  has 
a  link  to  each  other  router  as  shown  in  Pig.  1(b).  If  any  of 
the  optical  components,  flbers,  or  optical  cross-connects  fail, 
the  IP  routers  will  detect  link  failures.  The  prior  SRG  failure 
probabilities  are  10“^  and  10“^  for  the  flbers  and  the  cross- 
connect  respectively. 

The  causal  graph  (Pig.  2)  has  six  optical  components 
mapped  to  SRGs  Si-Sq.  To  account  for  potential  database 
and  observation  errors  a  noise  value  (10“^)  is  subtracted  from 
the  conditional  probability  of  each  edge,  and  noisy  edges  with 
this  value  are  added  to  form  a  complete  bipartite  graph.  E.g., 
Pr{Li\Si)  is  0.9999  while  Pr{L2\Si)  =  10“^. 

Suppose  Li  and  L2  are  down,  and  I/3  is  up.  Intuitively,  the 
cause  is  most  likely  the  failure  of  flber  link  P3.  As  described 
above,  SHRINK  only  considers  hypothesis  vectors  with  at 
most  three  total  assumed  failures.  Por  this  six  SRG  example 
SHRINK  searches  through  Ylk=o  (D  “  hypotheses,  with 
hypothesis  vector  <  0,0,0,0,1,0  >  maximizing  the  posterior 
probability  for  the  given  observations.  SHRINK  correctly 
identifles  SRG  S'5  (i.e.,  the  failure  of  flber  link  F3),  to  be 
the  root  cause. 

HI.  PORMULATION  OF  GRAPH  DIGEST  APPROACH 

In  this  section,  we  present  a  formulation  for  cross-domain 
fault  localization  based  on  information-preserving  transfor¬ 
mations  of  intra-domain  inference  graphs.  We  propose  a  set 
of  criteria  to  explicitly  deflne  the  two  primary  requirements 
of  cross-domain  fault  localization:  preservation  of  inference 
gain  and  protection  of  privacy.  Pinally,  we  discuss  the  main 
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technical  challenges  for  deriving  practical  algorithms  from  the 
proposed  formulation. 


A.  General  Framework 


As  discussed  above,  recent  intra-domain  approaches  use 
graphical  models  to  model  dependencies  between  aspects  of 
network  operation,  particularly  the  causal  relationships  be¬ 
tween  hardware  failures  and  observed  anomalies.  These  mod¬ 
els  (also  called  inference  graphs),  enable  inference  algorithms 
to  determine  those  failure  scenarios  best  explaining  observed 
anomalies.  In  practice,  faults  often  propagate  across  network 
domain  boundaries,  depriving  intra-domain  algorithms  of  criti¬ 
cal  information  required  for  accurate  inference.  We  address  the 
problem  by  sharing  summarized  intra-domain  models  (called 
graph  digests)  between  domains.  A  graph  digest  captures 
cross-domain  dependencies  while  hiding  internal  details. 

Our  approach  is  based  on,  and  designed  to  address,  the 
problems  that  arise  from  the  following  assumptions:  domains 
are  administratively  separated,  domain  managers  are  unwilling 
to  reveal  their  internal  network  structures  and  associated 
inference  graphs,  and  finally  domain  managers  are  willing  to 
collaborate  to  localize  faults  if  their  domain’s  internal  details 
are  hidden.  We  believe  these  assumptions  are  fundamental 
constraints  all  general  cross-domain  approaches  must  address. 

A  cross-domain  inference  model  based  on  graph  digests  is 
defined  as  follows.  Consider  n  network  domains: 


•  Gi  is  the  inference  graph  for  the  ith  domain. 

•  /  is  (ideally)  a  one-way  function  on  Gi  implementing  a 
privacy  policy.  f{Gi)  is  called  the  inference  graph  digest, 
or  simply  digest,  for  Gi. 


•  f(Gi)  ]  l±lGj,  where  j  is  a  domain  performing 

J 

cross-domain  inference  and  l±l  is  a  model-specific  union. 

is  the  cross-domain  model  integrating  the  digests  from 
all  the  other  domains  with  domain  j’s  undigested  graph. 
Now,  domain  j  may  use  an  existing  algorithm  such  as 
SHRINK  to  perform  inference  over  QF 


Before  a  practical  graph  digest  approach  can  be  imple¬ 
mented,  interoperability  standards  must  be  developed.  Do¬ 
mains  using  different  inference  methods  can  potentially  use 
a  digest  approach  if  standards  are  implemented  and  adhered 
to.  Items  to  be  standardized  include  data  types  and  attributes  as 
well  as  cross-domain  management  structures  such  as  central¬ 
ized,  distributed,  iterative,  etc.  We  define  a  shared  attribute 
as  a  physical  entity  or  logical  concept  modeled  in  two  or 
more  fault  propagation  causal  graphs,  and  that  has  the  same 
semantics  in  each  graph.  In  order  to  create  a  domain  digest  to 
connect  to  another  domain’s  fault  propagation  causal  graph, 
shared  attributes  must  be  identified  and  agreed  upon. 


B.  Modeling  Preservation  of  Inference  Gain 

The  function  /  above  is  useless  if  the  digest  it  produces  is 
not  useful  for  inference.  A  digest  function  (transformation)  is 
inference  preserving  if  it  maintains  enough  structure  to  allow 
successful  inference.  Ideally,  we  achieve  the  same  inference 
gain  using  digests  versus  undigested  graphs. 


Let  Bu  and  be  the  best  hypotheses  produced  using 
undigested  graphs  and  graph  digests  respectively.  Bu  and 
Bd  are  sets  of  potential  causes,  percentage  of  elements  in 
a  hypothesis  that  are  consistent  with  the  observations,  and 
coverage  ratio  c  that  measuring  the  percentage  of  all  of  the 
observations  that  a  hypothesis  can  explain:  ^ 

^  iBdHBul  ^  iBdHBul 

\Bd\  ’  \B^\  ■ 

It  is  clear  that  1  >  /i,  c  >  0.  The  ratios  of  false  positives  and 
false  negatives  are  1  —  and  1  —  c  respectively,  both  relative 
to  Bu.  The  ratios  h  and  c  can  each  be  easily  optimized  at  the 
expense  of  the  other,  which  may  be  overcome  by  computing 
the  harmonic  mean  of  the  two  values.  ^  We  propose  to  use 
the  harmonic  mean  a  as  the  criterion  to  measure  how  well  a 
digest  model  preserves  inference  gain: 

r  0  if/i  =  c  =  0 
otherwise 

Ideally  a  is  one,  when  both  h  and  c  equal  one.  The  definition 
of  a  can  be  generalized  for  the  case  where  more  than  one  best 
explanations  are  derived  by  the  inference  algorithm. 


C.  Modeling  Protection  of  Privacy 


We  define  a  sensitive  property  as  a  detail  the  domain  man¬ 
ager  considers  private.  Ideally,  a  graph  digest  should  not  help 
to  reveal  any  sensitive  properties.  Specific  sensitive  properties 
will  vary  between  domains  and  may  include  bottlenecks,  cus¬ 
tomer  information,  peering  agreements,  the  number  of  failed 
components,  and  many  other  characteristics.  Furthermore,  a 
set  of  digests  from  a  domain  collected  over  time  should  not 
aid  in  deriving  the  sensitive  properties  from  the  original  graph. 

Shannon  says  that  “perfect  secrecy”  is  defined  to  be  when 
the  a  priori  probability  is  equal  to  the  posterior  probability 
for  message  traffic  deciphering  by  an  adversary  116].  The 
same  concept  can  be  applied  as  a  criterion  for  inference 
graph  privacy.  One  has  to  assume  that  an  adversary  has 
some  domain  knowledge,  has  passive  access  to  externally 
observable  information,  and  can  infer  some  level  of  knowledge 
about  a  distribution  over  time.  Using  an  information  theoretic 
approach,  the  relative  entropy  (Kullback  Leibler  distance)  117], 
between  the  probability  distribution  of  the  sensitive  property 
without  a  digest  and  the  probability  distribution  after  receiving 
a  digest  measures  the  privacy  loss  due  to  sharing  a  digest.  Let 
5  represent  a  sensitive  property  in  a  domain  conditioned  by 
the  adversary’s  knowledge,  where  s\d  represents  the  property 
further  conditioned  by  a  digest.  Let  X  represent  the  set  of 
possible  values  for  5.  The  relative  entropy  equation  is: 


KL{s\d,s)  = 


(3) 


Ideally  this  distance  will  equal  zero  for  each  sensitive 
property  in  a  domain,  meaning  that  the  information  about  a 


Tn  AI  [14]  these  ratios  are  known  as  precision  and  recall  respectively 
^The  harmonic  mean  of  precision  and  recall  is  also  known  as  F  Score  [15] 
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sensitive  property  is  unchanged  after  receiving  a  digest.  Even 
if  the  entropy  is  reduced  for  a  sensitive  property,  the  entropy  of 
s\d  may  remain  sufficiently  high  to  protect  the  privacy  of  the 
property.  Ultimately,  the  resultant  entropy  of  s|d  and  not  the 
amount  of  entropy  lost,  indicates  the  level  of  privacy  protection 
for  a  sensitive  property. 

Unfortunately,  deriving  accurate  probability  distributions 
about  a  sensitive  property  in  a  domain,  particularly  from  an 
adversary’s  perspective,  may  not  be  possible.  An  ontology 
of  sensitive  properties  with  privacy  protection  implementation 
methods  is  needed.  If  prior  and  posterior  probability  distri¬ 
butions  can  be  derived  for  a  sensitive  property,  we  suggest 
that  the  relative  entropy  Eq.  (3)  be  used  to  evaluate  the 
privacy  protection  for  the  property.  Eor  sensitive  properties  that 
are  not  conducive  to  evaluation  with  probability  distributions, 
techniques  to  protect  properties  against  obvious  attacks  should 
be  implemented.  As  an  example,  reachability  information  may 
be  difficult  to  hide  with  a  relative  entropy  approach,  but  has  a 
well  established  method  for  evaluation:  the  transitive  closure 
of  an  adjacency  matrix  determines  the  reachability  between 
any  two  components  in  a  network.  A  digest  must  prevent  this 
obvious  attack  method  by  denying  adjacency  information  that 
could  be  used  to  establish  reachability.  A  function  /  could 
be  augmented  to  break  all  paths  between  nodes  that  would 
reveal  the  sensitive  property  (or  add  nodes  to  hide  a  lack  of 
reachability).  We  evaluate  privacy  against  the  transitive  closure 
attack  for  reachability  in  Section  IV. C. 

IV.  Illustration  of  digest  approach 

In  this  section,  we  demonstrate  the  utility  of  our  formulation 
by  showing  how  a  digest  for  a  bipartite  causal  graph  can 
be  created  to  enable  the  use  of  SHRINK  for  cross-domain 
fault  localization.  We  selected  SHRINK  to  illustrate  the  ideas 
presented  in  Section  III  due  to  its  robustness  and  simplicity. 
We  first  present  an  algorithm  to  create  a  digest  for  inference  by 
SHRINK.  Next,  we  describe  a  hypothetical  cross-domain  sce¬ 
nario  to  illustrate  the  possible  steps  of  creating  such  a  digest. 
Einally,  we  provide  a  brief  analysis  of  inference  preservation 
and  privacy  protection  for  the  scenario. 

A.  Bipartite  causal  graph  digest  creation 

A  main  challenge  for  implementing  the  digest  approach  is 
to  find  digest  creation  algorithms  that  to  be  practical,  must 
be  general  while  at  the  same  time  meeting  a  wide  range  of 
domain  specific  information  hiding  needs.  We  present  one  such 
algorithm,  createBipartiteDigest  (Eig.  3),  that  although  a 
bit  naive,  embodies  the  concepts  presented  in  Section  III.  We 
crafted  createBipartiteDigest  to  establish  a  cross-domain 
extension  to  SHRINK,  and  thus  the  assumptions  previously 
presented  for  SHRINK  in  Section  II  apply  to  this  algorithm  as 
well.  The  algorithm  executes  in  a  sequential  four  step  process: 
(1)  pruning  (lines  3,4),  (2)  partial  evaluation  (lines  5-8),  (3) 
aggregation  (lines  10-12),  then  (4)  renaming  (lines  13,14).  The 
pruning  step  removes  any  SRG  nodes  that  have  no  non-noisy 
edges  to  observation  nodes  reporting  failure.  Except  for  highly 
unlikely  cases,  these  nodes  will  have  a  very  low  score  and  will 


Great eBipart it eDigest(G) 

I:  Add  node  Lnew  to  G 

2:  for  all  SRG  Si  e  G 

3:  if  (for  all  edges  (Si^Lj)  G  G,  Lj  is  up) 

4:  then  Prune  Si  and  its  edges  {Si,Lj) 

5:  else 

6:  Collect  edges  (Si^Lj)  G  G  such  that  Lj  is  up 

7:  Add  edge  {Si,  Lnew)  using  Eq.  (4)  on  collected  edges 

8:  Prune  collected  edges  {Si,Lj) 

9:  Remove  all  isolated  observation  nodes  Li 
10:  for  all  SRG  S^,Sy  eG 
II:  if  Sx  and  Sy  are  indistinguishable 

12:  Aggregate  Sx  and  Sy  into  S^  such  that  S^  =  SxU  Sy 

13:  Rename  all  SRGs  that  are  not  shared  attributes 
14:  Rename  all  Observation  nodes  other  than  Lnew 

Fig.  3.  Algorithm  for  computing  a  digest  from  a  bipartite  causal  graph  G. 


Fig.  4.  Physical  Topology  of  considered  scenario. 


not  appear  on  a  list  of  best  explanations.  The  partial  evaluation 
step  uses  noisy-OR  to  combine  all  edges  from  an  SRG  Si  that 
point  to  k  observation  nodes  Li,  L2, L^  reporting  liveness 
into  a  single  node  Lnew-  The  noisy-OR  equation  to  compute 
the  new  edge  weight  is: 

k 

Pr{Lne^,\Si)  =  l-]ll-Pr{Lj\Si)  (4) 

i=i 

The  aggregation  step  of  the  algorithm  combines  SRGs  that 
have  the  same  prior  probabilities  and  edges.  These  SRGs 
will  have  identical  scores  on  a  list  of  best  explanations. 
Aggregation  of  SRGs  means  that  one  SRG  represents  a  set 
of  SRGs.  The  final  step,  renaming,  is  simply  assigning  a  new 
label  on  each  node  in  the  resultant  graph,  except  for  shared 
attributes  and  Lnew- 

B.  Cross-domain  scenario 

Three  months  ago  Blue  Inc.  (Domain  2)  started  a  lease  for 
three  optical  circuits  across  the  optical  mesh  provided  by  Red 
Inc.  (Domain  1),  a  large  provider  with  many  customers.  The 
physical  view  of  the  overlap  between  Blue  and  Red  is  depicted 
in  Eig.  4  and  the  view  of  provisioning  in  Eig.  5.  Redundant 
components  in  the  mesh  (not  portrayed)  provided  by  Red  were 


Circuit 


Fig.  5.  View  of  leased  circuits  provisioned  to  Domain  2. 
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Fig.  7.  Domain  2  reflecting  down  state  of  L4,  L5,  and  Lq. 

tested  to  ensure  that  Blue  could  transit  the  mesh  in  the  event 
of  a  failover.  Two  months  ago  Green  Inc.  subscribed  to  a 
number  of  circuits  across  the  Red  mesh.  Through  oversight, 
an  older  image  was  used  to  add  the  circuits  for  Green  Inc. 
to  the  tables  in  the  backup  Optical  Digital  Cross  Connect 
(ODCX)  O2.  The  backup  ODCX  O2  has  no  records  honoring 
the  leased  circuits  for  Blue.  This  morning,  O2  went  offline 
and  the  backup  component  came  online,  severing  connectivity 
for  Blue  across  the  mesh. 

Conducting  inference  in  isolation,  neither  Red  nor  Blue  is 
able  to  isolate  the  problem.  From  the  perspective  of  Red,  all 
tools  show  a  healthy  network  status  and  the  circuits  show 
liveness.  From  the  perspective  of  Blue,  no  traffic  can  cross 
the  leased  circuits. 

Fig.  6  portrays  the  IP  link  connectivity  for  Blue.  The 
administrators  at  Blue  use  the  SHRINK  algorithm  for  fault 
localization,  and  Fig.  7  reflects  the  graph  with  the  three  failed 
IP  links  highlighted.  The  optical  circuits  are  shared  attributes, 
which  Blue  has  modeled  as  SRGs  Ci  —  C^.  Not  knowing  how 
the  optical  mesh  is  configured.  Blue  has  assigned  a  uniform 
prior  probability  of  failure  at  10“^  for  each  SRC  in  the  graph. 
Inference  for  the  best  explanation  returns  {R4},  {Ci}, 

{C2},  and  {Cs}  as  equally  likely. 

Red  agrees  to  perform  probabilistic  inference  on  a  combined 
inference  graph  using  a  digest  from  Blue.  Blue  has  one 
sensitive  property  to  hide  from  Red:  the  internal  reachability 
between  R4  and  R^.  Blue  is  interested  in  hiding  whether 


Fig.  9.  Domain  2  partial  evaluation  by  combining  all  IP  link  nodes  reporting 
liveness  into  Lu  using  noisy-OR. 


Fig.  10.  Domain  2  reflecting  aggregation  of  R2,  Rq  into  Ui,  and  P3,  P5 
into  U2- 


Fig.  11.  Completed  Domain  2  digest.  Internal  SRGs  renamed. 

they  are  able  to  transmit  data  between  i?4  and  R^  if  Circuits 
Cl  —  C3  fail.  The  digest  construction  proceeds  with  pruning 
(Fig.  8),  partial  evaluation  (Fig.  9),  aggregation  (Fig.  10),  and 
renaming  (Fig.  11). 

Red  creates  a  union  of  their  graph  with  the  digest  from 
Blue.  Red  also  uses  SHRINK  for  inference,  so  the  combined 
graph  must  be  converted  to  a  bipartite  graph.  The  provided 
circuits  Ci,  C2,  and  C3  are  logical  in  nature  and  although 
they  are  needed  to  connect  the  graphs,  they  are  not  needed 
in  the  final  graph  for  inference.  Each  edge  in  the  Red  portion 
of  the  graph  that  is  directed  to  one  of  the  circuit  nodes  is 
redirected  to  each  observation  node  that  the  circuit  node  is 
directed  to,  and  all  of  the  shared  attribute  nodes  are  then 
pruned  from  the  graph.  The  final  causal  graph  is  shown  in 
Fig.  13.  Red  uniformly  assigns  10“^  prior  probability  to  each 
SRC.  SHRINK  inference  run  on  the  graph  returns  =  {O2} 
as  the  best  explanation,  with  a  score  significantly  higher  than 
the  other  hypotheses.  Convinced  that  the  lost  connectivity  for 
Blue  is  most  likely  caused  by  O2,  Red  proceeds  to  diagnose 
the  ODCX  and  restore  service  to  Red. 

C.  Analysis  of  criteria 

Running  SHRINK  on  the  union  of  the  Red  graph  with  the 
undigested  Blue  graph  returns  Bu  =  {O2}  sls  the  best  explana- 


Fig.  12.  Gi  U  /(C2).  The  nodes  Ci  —  C3  serve  as  shared  attributes  and 
make  the  graph  union  possible. 


Fig.  13.  GiW/(G2). 
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tion  with  a  score  significantly  higher  than  the  other  hypotheses. 
The  hit  ratio  h  and  coverage  ratio  c  are  both  1 .0,  resulting  in 
a  perfect  inference  preservation  score  {a  =  1.0).  To  further 
illustrate  the  a  criterion,  suppose  instead  the  inference  results 
received  were  =  {02,^3,  5^4}  and  Bu  =  {02,Fi}.  For 
this  scenario  h  =  0.33,  c  =  0.5,  and  a  would  be  0.4. 

To  evaluate  privacy  protection,  we  first  need  a  technique  to 
create  an  adjacency  matrix  using  the  SHRINK  inference  graph. 
Since  the  privacy  hiding  goal  is  based  on  hiding  reachability 
between  hardware  components,  we  populate  the  elements  of 
the  adjacency  matrix  with  the  SRGs.  For  each  SRG,  all  of  the 
parents  of  the  observation  nodes  that  the  SRG  has  an  edge 
to  are  considered  adjacent  to  the  SRG.  For  example,  Ri  in 
Fig.  7  is  adjacent  to  Ri,  Pi,  and  R2  via  Li;  and  Pi,  P2,  and 
P3  via  1/2-  In  building  the  adjacency  matrix  for  the  digest, 
we  can’t  use  Lu  to  establish  adjacency  since  the  parent  nodes 
of  Lu  don’t  necessarily  reach  each  other.  We  can’t  use  any 
observation  nodes  for  which  Ci  —  C3  is  a  parent  since  Blue 
wants  to  hide  reachability  between  P4  and  P5  in  the  event  of 
failure  across  the  optical  mesh.  Since  there  are  no  observation 
nodes  available  to  establish  adjacency,  privacy  protection  for 
the  sensitive  property  is  trivially  satisfied. 

Suppose  that  the  digest  for  Blue  included  the  IP  links  I/7 
and  1/8  as  down  observation  nodes.  By  the  algorithm  (Fig.  3), 
these  nodes  and  all  SRGs  that  can  affect  them  will  appear 
in  the  digest.  Constructing  the  adjacency  matrix  as  above 
and  computing  the  transitive  closure  will  clearly  reveal  the 
sensitive  property.  The  algorithm  may  be  refined  so  that  the 
SRG  reachable  by  both  P4  and  P5  in  the  closure  that  has 
the  lowest  inference  result  using  the  undigested  graph  can 
be  pruned  from  the  digest.  This  process  can  be  repeated  and 
checked  until  P4  and  P5  are  no  longer  reachable  in  the  closure 
matrix.  Clearly  accuracy  may  suffer  from  such  a  heuristic, 
further  highlighting  the  tension  between  accuracy  and  privacy. 

V.  CONCLUSION 

Network  faults  often  have  observable  effects  in  multiple 
domains.  This  paper  demonstrated  that  cross-domain  fault 
localization,  by  correlating  the  observations  from  different 
domains,  has  the  potential  to  significantly  increase  the  accu¬ 
racy  of  network  fault  localization.  It  also  articulated  the  main 
challenges  to  realize  the  inference  gain,  particularly  the  privacy 
consideration.  The  main  contribution  is  an  inference-graph- 
digest  based  formulation  of  the  problem.  The  formulation  not 
only  explicitly  models  the  inference  accuracy  and  information 
hiding  requirements,  but  also  facilitates  the  re-use  of  exist¬ 
ing  fault  localization  algorithms  without  compromising  each 
domain’s  information  hiding  policy. 

To  move  forward  certainly  requires  a  fundamental  under¬ 
standing  of  the  issues  that  is  beyond  the  formulation  and 
scenario  described  in  this  paper.  Is  the  graph  digest  approach 
applicable  to  a  wide  range  of  network  scenarios?  What  about 
scenarios  involving  more  than  two  domains?  Does  there  exist 
a  general,  yet  easily  calculable  metric  for  quantifying  the 
highly  domain- specific  information  hiding  policy?  How  should 
observation  errors  and  graph  model  inaccuracies  be  detected 


and  controlled?  We  believe  these  and  other  similar  questions 
constitute  a  new  area  of  networking  research  which  may  have 
a  major  impact  on  how  we  trouble-shoot  network  faults. 
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